AITopics | nyi divergence

Comparing two probability distributions is a basic building block of statistics and machine learning, and the right family is well understood: the Rényi divergences of order $α\in[0,\infty]$ are the unique family monotone under data processing and additive on independent products. Many problems instead compare more than two distributions at once -- multi-population fairness, multi-prior PAC-Bayes bounds, multi-hypothesis testing -- and the right multi-distribution generalization of the Rényi family has been an open question. We characterize it. Every functional of $W$-tuples of distributions that is monotone under data processing and additive on independent products is a positive integral of multi-way coincidence divergences $C_α(π_1,\dots,π_W) := -\log\int π_1^{α_1}\cdotsπ_W^{α_W}$ (with $\sum_k α_k = 1$) over a parameter space with four strata: the simplex interior; mixed-sign exponent cones (the analogue of Rényi orders $>1$); a tropical boundary at infinity carrying max-divergences; and pairwise Kullback-Leibler edges at the simplex vertices. Each stratum is necessary -- the destination of an explicit data-processing-monotone, product-additive divergence the others cannot reproduce -- and each is a clean limit of simplex-interior atoms. The same family arises from several independent routes -- the structural axioms, Kolmogorov-Nagumo means with Rényi's entropy axiomatics, classical entropy characterizations, multi-hypothesis testing error exponents, and a multi-lottery betting interpretation -- structural evidence that this is the canonical multi-distribution Rényi calculus rather than an artefact of any one axiomatic input. The two-prior case recovers the standard Rényi result; a worked $W=3$ instance, numerical verification, and a conditional extension round out the treatment.

artificial intelligence, divergence, machine learning, (18 more...)

arXiv.org Machine Learning

2606.27349

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Information from coincidences

Balsubramani, Akshay

arXiv.org Machine LearningJun-25-2026

We prove a single algebraic mixed coincidence identity that unifies a broad swath of information-theoretic variational results. For any family of priors $\{π_i\}$ and real exponents $\{ α_i \}$, the log of the mixed count $E_{x\simν}\!\left[\prod_{i=1}^W π_i^{α_i}(x)\right]$ is simultaneously a Boltzmann coincidence weight, an exponential-family normalizer, a maximum-entropy value, and a KL-barycenter optimum. The identity yields a unified derivation of classical cornerstones of information theory: concentration of empirical distributions (Sanov-type decompositions and Gibbs conditioning), hypothesis-testing error exponents (Chernoff information and its multi-way analogue), change-of-measure inequalities (Donsker-Varadhan and PAC-Bayes), and laws governing rare-pattern coincidences (Erdos-Renyi run-length, iterative guesswork, rate-distortion, and birthday thresholds). Each is recovered as a specialization of the same algebraic equality. It strictly generalizes the classical Renyi entropy and divergence variational formulas (one and two priors respectively) to a $W$-prior simplex, and holds for unnormalized and continuum-indexed priors. Among its consequences are an exact multi-prior PAC-Bayes penalty that subtracts an explicit "coincidence bonus" from the usual single-prior posterior penalty, and the asymptotic MAP error exponent for $W$-ary hypothesis testing as an edge-restricted simplex optimum. We demonstrate the calculus at scale on two large alphabets encoding richly modeled sequential languages: on language-model next-token predictives where we recover contrastive decoding, and on human genomic regulatory sequence where it separates correlated from diverse prior families along a sliding-window trace.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2606.25042

Country: Europe (0.45)

Genre: Research Report > New Finding (0.45)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Leisure & Entertainment (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.45)

Add feedback

On the Stability of Spherical Hellinger-Kantorovich Flows and Their Implications for Differential Privacy

Mustafi, Aratrika, Mukherjee, Soumya

arXiv.org Machine LearningMay-25-2026

We consider the problem of sampling from an unnormalized Boltzmann/ Gibbs density, π(θ) exp V(θ),θ Θ Rd, where the normalization constant is unknown (and/or intractable) and only the potential function V (and typically its derivatives) can be evaluated. This problem arises across various domains in Bayesian inference, statistical physics, and modern machine learning. A common variational perspective on sampling is to characterize the target distribution π as the unique minimizer of a functional (typically a divergence functional) over the space of probability measures. From this viewpoint, sampling can be formulated as evolving an initial distribution ρ0 toward π via the gradient flow of this functional under a suitable geometric structure on the space of probability measures. In this paper, we focus on a gradient flow based sampling methodology built from the spherical Hellinger Kantorovich (SHK), also known as the Wasserstein Fisher Rao (WFR), geometry on the space of probability measures (Kondratyev and Vorotnikov, 2019; Liero et al., 2018; Chizat et al., 2015). When the variational objective is the exclusive KL divergence ρ 7 KL(ρ π), the SHK gradient flow generates a time-indexed family of marginals {ρt}t 0 (initialized at ρ0 P2(Θ)) that evolves according to the continuity reaction equation (4). This evolution is equivalent to the birth-death Langevin dynamics introduced in Lu et al. (2019) .

artificial intelligence, exp, machine learning, (20 more...)

arXiv.org Machine Learning

2605.23879

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)

Add feedback

2a7157c84dcf263f77b37d6c11d7d149-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 05:47:35 GMT

Add feedback

18561617ca0b4ffa293166b3186e04b0-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 20:34:13 GMT

artificial intelligence, machine learning, privacy amplification, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

04b42392f9a3a16aea012395359b8148-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 08:56:09 GMT

artificial intelligence, equation, machine learning, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback

Tail-Aware Information-Theoretic Generalization for RLHF and SGLD

Zhang, Huiming, Li, Binghan, Tian, Wan, Sun, Qiang

arXiv.org Machine LearningApr-14-2026

Classical information-theoretic generalization bounds typically control the generalization gap through KL-based mutual information and therefore rely on boundedness or sub-Gaussian tails via the moment generating function (MGF). In many modern pipelines, such as robust learning, RLHF, and stochastic optimization, losses and rewards can be heavy-tailed, and MGFs may not exist, rendering KL-based tools ineffective. We develop a tail-dependent information-theoretic framework for sub-Weibull data, where the tail parameter $θ$ controls the tail heaviness: $θ=2$ corresponds to sub-Gaussian, $θ=1$ to sub-exponential, and $0<θ<1$ to genuinely heavy tails. Our key technical ingredient is a decorrelation lemma that bounds change-of-measure expectations using a shifted-log $f_θ$-divergence, which admits explicit comparisons to Rényi divergence without MGF arguments. On the empirical-process side, we establish sharp maximal inequalities and a Dudley-type chaining bound for sub-Weibull processes with tail index $θ$, with complexity scaling as $\log^{1/θ}$ and entropy$^{1/θ}$. These tools yield expected and high-probability PAC-Bayes generalization bounds, as well as an information-theoretic chaining inequality based on multiscale Rényi mutual information. We illustrate the consequences in Rényi-regularized RLHF under heavy-tailed rewards and in stochastic gradient Langevin dynamics with heavy-tailed gradient noise.

artificial intelligence, exp, machine learning, (18 more...)

arXiv.org Machine Learning

2604.10727

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Algorithmic warm starts for Hamiltonian Monte Carlo

Zhang, Matthew S., Altschuler, Jason M., Chewi, Sinho

arXiv.org Machine LearningMar-25-2026

Generating samples from a continuous probability density is a central algorithmic problem across statistics, engineering, and the sciences. For high-dimensional settings, Hamiltonian Monte Carlo (HMC) is the default algorithm across mainstream software packages. However, despite the extensive line of work on HMC and its widespread empirical success, it remains unclear how many iterations of HMC are required as a function of the dimension $d$. On one hand, a variety of results show that Metropolized HMC converges in $O(d^{1/4})$ iterations from a warm start close to stationarity. On the other hand, Metropolized HMC is significantly slower without a warm start, e.g., requiring $Ω(d^{1/2})$ iterations even for simple target distributions such as isotropic Gaussians. Finding a warm start is therefore the computational bottleneck for HMC. We resolve this issue for the well-studied setting of sampling from a probability distribution satisfying strong log-concavity (or isoperimetry) and third-order derivative bounds. We prove that \emph{non-Metropolized} HMC generates a warm start in $\tilde{O}(d^{1/4})$ iterations, after which we can exploit the warm start using Metropolized HMC. Our final complexity of $\tilde{O}(d^{1/4})$ is the fastest algorithm for high-accuracy sampling under these assumptions, improving over the prior best of $\tilde{O}(d^{1/2})$. This closes the long line of work on the dimensional complexity of MHMC for such settings, and also provides a simple warm-start prescription for practical implementations.

artificial intelligence, inequality, machine learning, (16 more...)

arXiv.org Machine Learning

2603.22741

Country: